12 research outputs found

    How Much Speech Data Is Needed for Tracking Language Change in Alzheimer’s Disease? A Comparison of Random Length, 5-Min, and 1-Min Spontaneous Speech Samples

    Get PDF
    Introduction: Changes in speech can act as biomarkers of cognitive decline in Alzheimer’s disease (AD). While shorter speech samples would promote data collection and analysis, the minimum length of informative speech samples remains debated. This study aims to provide insight into the effect of sample length in analyzing longitudinal recordings of spontaneous speech in AD by comparing the original random length, 5- and 1-minute-long samples. We hope to understand whether capping the audio improves the accuracy of the analysis, and whether an extra 4 min conveys necessary information. Methods: 110 spontaneous speech samples were collected from decades of Youtube videos of 17 public figures, 9 of whom eventually developed AD. 456 language features were extracted and their text-length-sensitivity, comparability, and ability to capture change over time were analyzed across three different sample lengths. Results: Capped audio files had advantages over the random length ones. While most extracted features were statistically comparable or highly correlated across the datasets, potential effects of sample length should be acknowledged for some features. The 5-min dataset presented the highest reliability in tracking the evolution of the disease, suggesting that the 4 extra minutes do convey informative data. Conclusion: Sample length seems to play an important role in extracting the language feature values from speech and tracking disease progress over time. We highlight the importance of further research into optimal sample length and standardization of methods when studying speech in AD

    Multi-SimLex: A Large-Scale Evaluation of Multilingual and Cross-Lingual Lexical Semantic Similarity

    Get PDF
    We introduce Multi-SimLex, a large-scale lexical resource and evaluation benchmark covering data sets for 12 typologically diverse languages, including major languages (e.g., Mandarin Chinese, Spanish, Russian) as well as less-resourced ones (e.g., Welsh, Kiswahili). Each language data set is annotated for the lexical relation of semantic similarity and contains 1,888 semantically aligned concept pairs, providing a representative coverage of word classes (nouns, verbs, adjectives, adverbs), frequency ranks, similarity intervals, lexical fields, and concreteness levels. Additionally, owing to the alignment of concepts across languages, we provide a suite of 66 cross-lingual semantic similarity data sets. Due to its extensive size and language coverage, Multi-SimLex provides entirely novel opportunities for experimental evaluation and analysis. On its monolingual and cross-lingual benchmarks, we evaluate and analyze a wide array of recent state-of-the-art monolingual and cross-lingual representation models, including static and contextualized word embeddings (such as fastText, monolingual and multilingual BERT, XLM), externally informed lexical representations, as well as fully unsupervised and (weakly) supervised cross-lingual word embeddings. We also present a step-by-step data set creation protocol for creating consistent, Multi-Simlex -style resources for additional languages. We make these contributions - the public release of Multi-SimLex data sets, their creation protocol, strong baseline results, and in-depth analyses which can be be helpful in guiding future developments in multilingual lexical semantics and representation learning - available via a website which will encourage community effort in further expansion of Multi-SimLex to many more languages. Such a large-scale semantic resource could inspire significant further advances in NLP across languages
    corecore